Univariate data analysis

Tables, Boxplot & Histogram in Python

Describing Data

Summary statistics, which include things like the mean, min, and max of the data, can be useful to get a feel for how large some of the variables are and what variables may be the most important.

Statistical summary for numeric data

Statistical summary for categorical or string variables

It will show "count", "unique", "top", and "freq".

Histogram

Plot a histogram of sale price of all the houses in the data.

Boxplot

Plot a boxplot of sale price of all the houses in the data. Boxplots do not show the shape of the distribution, but they can give us a better idea about the center and spread of the distribution as well as any potential outliers that may exist. Boxplots and Histograms often complement each other and help us understand more about the data.

Histograms and Boxplots by Groups

Plotting by groups, we can see how a variable changes in response to another. For example, if there is a difference between house sale price with or with no central air conditioning. Or if house sale price varies according to the size of garage, and so on.

Boxplot and histogram of house sale price grouped by with or with no air conditioning

It is obviously that the mean and median sale price for houses with no air conditioning are much lower than the houses with air conditioning.

Box plot and histogram of house sale price grouped by garage size

The bigger the garage size, the higher house median price, this works until we reach 3-cars garage. Apparently, the houses with 3-cars garages have the highest median price, even higher than the houses with 4-cars garage.

Histogram of house sales price with no garage

Histogram of house sales price with 1-car garage

Histogram of house sales price with 2-car garage

Histogram of house sales price with 3-car garage

Histogram of house sales price with 4-car garage

Frequency Table

Frequency tells us how often something happened. Frequency tables give us a snapshot of the data to allow us to find patterns.

Overall Quality frequency table

Garage Size frequency Table

Central Air Conditioning frequency Table

Numerical Summaries

A quick way to get a set of numerical summaries for a quantitative variable is with the describe method.

We can also calculate individual summary statistics of SalePrice.

Calculate the proportion of the houses with sale price between 25th percentile (129975) and 75th percentile (214000).

Calculate the proportion of the houses with total square feet of basement area between 25th percentile (795.75) and 75th percentile (1298.25).

We calculate the proportion of the houses based on either conditions. Since some houses are under both criteria, the proportion below is less than the sum of the two proportions calculated above.

Calculate sale price IQR for houses with no air conditioning.

Calculate sale price IQR for houses with air conditioning.

Stratification

Another way to get more information out of a dataset is to divide it into smaller, more uniform subsets, and analyze each of these "strata" on its own. We will create a new HouseAge column, then partition the data into HouseAge strata, and construct side-by-side boxplots of the sale price within each stratum.

The older the house, the lower the median price, that is, house price tends to decrease with age, until it reaches 100 years old. The median price of over 100 year old houses is higher than the median price of houses age between 80 and 100 years.

We have learned earlier that house price tends to differ between with and with no air conditioning. From above graph, we also find out that recent houses (9-40 years old) are all equiped with air conditioning.

We now group first by air conditioning, and then within air conditioning group by age bands. Each approach highlights a different aspect of the data.

We can also stratify jointly by House age and air conditioning to explore how building type varies by both of these factors simultaneously.

For all age groups, vast majority type of dwelling in the data is 1Fam. The older the house, the more likely to have no air conditioning. However, for a 1Fam house over 100 years old, it is a little more likely to have air conditioning than not. There were neither very new nor very old duplex house types. For a 40 - 60 year old duplex house, it is more likely to have no air conditioning.

Multivariate Analysis

Scatter plot

A scatterplot is a very common and easily-understood visualization of quantitative bivariate data. Below we make a scatterplot of Sale Price against Above ground living area square feet. it is apparently a linear relationship.

2D Density Jointplot

The following two plot margins show the densities for the Sale Price and Above ground living area separately, while the plot in the center shows their density jointly.

Heterogeneity and stratification

We continue exploring the relationship between SalePrice and GrLivArea, stratifying by BldgType.

In almost all building types, SalePrice and GrLivArea shows a positive linear relationship. In the results below, we see that the correlation between SalepPrice and GrLivArea in 1Fam building type is the highest at 0.74, while in Duplex building type the correlation is the lowest at 0.49.

Categorical bivariate analysis

We create a contingency table, counting the number of houses in each cell defined by a combination of building type and the general zoning classification.

Below we normalize within rows. This gives us the proportion of houses in each zoning classification that fall into each building type variable.

We can also normalize within the columns. This gives us the proportion of houses within each building type that fall into each zoning classification.

We can look at the proportion of houses in each zoning class, for each combination of the air conditioning and building type variables.

The highest proportion of houses in the data are the ones with zoning RL, with air conditioning and 1Fam building type. With no air conditioning, the highest proportion of houses are the ones in zoning RL and Duplex building type.

Mixed categorical and quantitative data

Lastly, to get fancier, we plot a violin plot to show the distribution of SalePrice for houses that are in each building type category.

We can see that the SalesPrice distribution of 1Fam building type are slightly right-skewed, and for the other building types, the SalePrice distributions are nearly normal.